Using Compression For Source Based Classification Of
نویسندگان
چکیده
This thesis addresses the problem of source based text classification. In a nutshell, this problem involves classifying documents according to "where they came from" instead of the usual "what they contain". Viewed from a machine learning perspective, this can be looked upon as a learning problem and can be classified into two categories: supervised and unsupervised learning. In the former case, the classifier is presented with known examples of documents and their sources during the training phase. In the testing phase, the classifier is given a document whose source is unknown, and the goal of the classifier is to find the most likely one from the category of known sources. In the latter case, the classifier is just presented with samples of text, and its goal is to detect regularities in the data set. One such goal could be a clustering of the documents based on common authorship. In order to perform these classification tasks, we intend to use compression as the underlying technique. Compression can be viewed as a predict-encode process where the prediction of upcoming tokens is done by adaptively building a model from the text seen so far. This source modelling feature of compression algorithms allows for classification by purely statistical means. Thesis Supervisor: Shafi Goldwasser Title: RSA Professor of Computer Science and Engineering
منابع مشابه
فشردهسازی تصویر با کمک حذف و کدگذاری هوشمندانه اطلاعات تصویر و بازسازی آن با استفاده از الگوریتم های ترمیم تصویر
Compression can be done by lossy or lossless methods. The lossy methods have been used more widely than the lossless compression. Although, many methods for image compression have been proposed yet, the methods using intelligent skipping proper to the visual models has not been considered in the literature. Image inpainting refers to the application of sophisticated algorithms to replace lost o...
متن کاملChemometrics-enhanced Classification of Source Rock Samples Using their Bulk Geochemical Data: Southern Persian Gulf Basin
Chemometric methods can enhance geochemical interpretations, especially when working with large datasets. With this aim, exploratory hierarchical cluster analysis (HCA) and principal component analysis (PCA) methods are used herein to study the bulk pyrolysis parameters of 534 samples from the Persian Gulf basin. These methods are powerful techniques for identifying the patterns of variations i...
متن کاملImage Classification via Sparse Representation and Subspace Alignment
Image representation is a crucial problem in image processing where there exist many low-level representations of image, i.e., SIFT, HOG and so on. But there is a missing link across low-level and high-level semantic representations. In fact, traditional machine learning approaches, e.g., non-negative matrix factorization, sparse representation and principle component analysis are employed to d...
متن کاملImplementation of VlSI Based Image Compression Approach on Reconfigurable Computing System - A Survey
Image data require huge amounts of disk space and large bandwidths for transmission. Hence, imagecompression is necessary to reduce the amount of data required to represent a digital image. Thereforean efficient technique for image compression is highly pushed to demand. Although, lots of compressiontechniques are available, but the technique which is faster, memory efficient and simple, surely...
متن کاملExergy and Energy Analysis of Diesel Engine using Karanja Methyl Ester under Varying Compression Ratio
The necessity for decrease in consumption of conventional fuel, related energy and to promote the use of renewable sources such as biofuels, demands for the effective evaluation of the performance of engines based on laws of thermodynamics. Energy, exergy, entropy generation, mean gas temperature and exhaust gas temperature analysis of CI engine using diesel and karanja methyl ester blends at d...
متن کاملSpace Vector Modulation Based on Classification Method in Three-Phase Multi-Level Voltage Source Inverters
Pulse Width Modulation (PWM) techniques are commonly used to control the output voltage and current of DC to AC converters. Space Vector Modulation (SVM), of all PWM methods, has attracted attention because of its simplicity and desired properties in digital control of Three-Phase inverters. The main drawback of this PWM technique is 
its complex and time-consuming computations in real-time ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2014